Zamani et al. BMC Bioinformatics 2014, 15:227 
http://www.bionnedcentral.conn/1471 -21 05/1 5/227 



Bioinformatics 



SOFTWARE Open Access 



A universal genomic coordinate translator for 
comparative genomics 

Neda Zamani^'', Corel Sundstr6m\ Jennifer RS Meadows\ Marc P H6ppner\ Jacques Dainat\ Henrik Lantz\ 
Brian J Haas^ and Manfred G Grabherr^'^" 



Abstract 

Background: Genomic duplications constitute major events in the evolution of species, allowing paralogous copies 
of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between 
copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. 
However, a comprehensive analysis of duplication events and their contributions to evolution would require 
all-to-all genome alignments, which increases at with the number of available genomes, N. 

Results: Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph 
of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted 
genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on 
the set of 12 Drosophilo genomes, finding that orthologous correspondence computed indirectly through a graph 
of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime 
by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, 
mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous 
relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, 
and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple 
vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or 
many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) 
genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, 
likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant 
correlation of expression levels across species. 

Conclusions: Kraken is a computational genome coordinate translator that facilitates cross-species comparisons, 
distinguishes orthologs from paralogs, and does not require costly all-to-all whole genome mappings. Kraken is 
freely available under LPGL from http://github.com/nedaz/kraken. 

Keywords: Comparative genomics. Genomic coordinate translation. Genomic duplication. Cross-species gene 
expression analysis 



Background 

An organisms genome contains a collection of genes and 
regulatory elements that are located in particular order and 
orientation. The organization and relative distance of these 
features from one another can direct their activity patterns. 
A classic example includes the Hox gene clusters [1,2], 
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which are preserved as four to eight distinct regions among 
vertebrates [3]. The Hox proteins and their regulatory ma- 
chinery control organism development, and while the dif- 
ferent copies share sequence similarity with each other, it is 
the spatial organization that determines the timing of ex- 
pression. The Hox genes have undergone several duplica- 
tion and expansion events during evolutionary history, 
allowing for more specialized roles in fine-tuning develop- 
ment, facilitating increased organismal complexity [4]. 
More generally, gene duplications are a common mechan- 
ism for subsequent sub- and neo-fianctionalization, both in 
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terms of proteins, as well as in regulation. For example, up 
to two thirds of vertebrate genes are members of families 
[5], which have been generated by local duplications, block 
duplications and whole genome duplications. 

For a comprehensive comparative genomics study, it is 
thus paramount to accurately identify orthologous se- 
quences that arose through duplication events, both 
prior to and after speciation. Due to complex patterns of 
selective pressure, resulting in conservation of se- 
quences, loss of sequences, as well as invention of new 
functions, nucleotide or protein similarity is not a reli- 
able measure to unambiguously resolve orthologs [6]. 
However, orthologs can be recognized through conserved 
synteny, i.e. the locally conserved order and orientation 
of features, which has been observed even in species 
with high turnover rates of gene duplication, expansion 
and loss, such as in mammals [7]. Synteny alignments are 
either computed relative to one central genome, as for ex- 
ample human, as described in the 29 mammalian genomes 
project [8], or via a complete set of pairwise comparisons, 
where the computational time for the analysis of N ge- 
nomes is in the order of O(N^). Here, we describe a novel 
computational method, Kraken, which provides both the 
independence of a central genome, as well as eliminating 
the time consuming step of generating all pairwise synteny 
maps. For setting up a synteny framework, Kraken uses 
maps generated by the genome-wide synteny alignment 
programs and methods such as LASTZ chained align- 
ments [9], Mummer [10], SyMAP [11], Satsuma [12], 
SynMap [13], or alignment graphs, for example Enredo/ 
Pecan [14], and HAL [15]. Next, Kraken infers indirect 
syntenic relationships between two genomes not connected 
through a synteny map via indirection, i.e. the mapping of 
regions through a graph of synteny maps and by dynamic- 
ally augmenting local alignments. 

As such, Kraken constitutes a core utility aiming at re- 
solving several bioinformatics challenges currently facing 
life sciences. In particular, high-throughput sequencing 
technologies generate millions of reads from RNA, 
which, in turn, predict tens or hundreds of thousands of 
transcribed loci, each of which can contain multiple iso- 
forms. To assess the quality and biological relevance of 
each prediction, additional evidence is required. An au- 
tomated utility, which compares transcriptional activity 
in orthologous regions across multiple species, can pro- 
vide such evidence, in particular for novel loci that 
might have a function that is yet to be determined. 

In the following sections, we describe Kraken s imple- 
mentation, and evaluation of the accuracy, sensitivity, 
and specificity on a set of 12 Drosophila genomes span- 
ning a long range of genomic distances [16] and the 
well-annotated mouse, human, and rat genomes. We 
then apply Kraken to rapidly and automatically establish 
transcription orthology maps across multiple vertebrate 



species for known and novel loci, leveraging previously 
generated RNA-Sequence data [17]. 

Implementation 

An overview of Kraken s workflow is shown in Figure la: 
the input is comprised of (i) genome sequences, in 
chromosome or scaffold coordinates; (ii) synteny maps, 
generated from synteny alignment programs or extracted 
from multiple alignment graphs; and (iii) annotations or 
feature coordinates, specified in Gene Transfer Format 
(GTF). For each genome, Kraken s output is provided on 
two levels: (i) GTF files in translated coordinates, pre- 
serving all of the original information; and, if applicable, 
(ii) a list of spatial relationships between overlapping 
translated annotations and native annotations. Figure lb 
shows a flow chart of how the input data is processed. 
After all genomes, synteny maps, and input coordinates 
are loaded into memory, each query coordinate is trans- 
lated into the genomic coordinate system of the target 
genome in a multi-step process: (i) estimate candidate 
locations of orthologous coordinates through the syn- 
teny graph; (ii) perform a rapid alignment of the input 
sequence against the target region based on a cross- 
correlation algorithm; and (iii) compute a local sequence 
alignment to determine the exact target coordinates. Op- 
tionally, (iv) coordinates in target coordinates are com- 
pared against a reference GTF, accommodating for features 
with multi-exonic structures and multiple isoforms. In the 
following, we describe each step in detail. 

Configuration 

Kraken reads the meta-data accompanying a dataset from a 
configuration file. This file lists the file locations of the ge- 
nomes (FASTA format), the synteny maps (LASTZ general 
format and Satsuma format, Kraken provides conversion 
tools for other formats) and the genomes they connect and 
in what direction. Kraken requires the pre-computed pair- 
wise synteny maps to provide the coordinates of ortholo- 
gous regions (i.e. not the synteny alignments), in intervals 
specifying the corresponding: (i) chromosome or scaffold 
name; (ii) first nucleotide position in the interval; (iii) last 
nucleotide position in the interval. Maps need to be pro- 
vided in one direction (genome A genome B) only, and 
are duplicated and inverted in memory to supply both di- 
rections (genome B genome A). All positions (start and 
stop) in the synteny maps are sorted by chromosome first 
and position next in target coordinates, setting up a data 
structure for binary search. 

Synteny graph 

Kraken performs an exhaustive search through all possi- 
bilities from source to target genomes through a synteny 
graph built from pairwise synteny as specified in the con- 
figuration file, and selects the path with either (a) the 
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Figure 1 Overview of Kral<en. (a) Showing Kraken's input and output: genomes are supplied in FASTA format, synteny maps are in LASTZ 
general format, computed by any synteny based aligner such as LASTZ chained alignments [9], Satsuma [12], SyMAP [1 1], Enrodeo/Pecan 
alignment graphs [14]. Annotations are provided in Genome Transfer Format (GTF) files for one or more genomes, and can be either curated 
annotations, or experimentally found sequences, e.g. through Tophat/Cufflinks [18] or Trinity/PASA [19]. As output, Kraken lists the input GTF file 
items in output genome coordinates, as well as spatial relationships between translated features and features that are native to the genome. 
Features are hierarchically organized into loci, which are sets of transcripts, which are collections of exons. (b) Technical overview of Kraken's 
workflow. Genomes and synteny maps are loaded, and a complete set of paths connecting each genome to each other is computed prior to 
processing. For each annotation, Kraken finds an exact or approximate orthology match in each other genome, and runs a two-step local 
alignment to determine the boundaries of the orthologous feature. Once completed, Kraken examines the locus-transcript-exon structure of the 
translated input and compares it, if available, to annotations native to the target genome. 



lowest number of indirect mappings, or (b) by minimizing 
the accumulated genomic distances, if specified. The path 
is determined prior to coordinate translation and fixed 
from there on. More formally, the synteny graph G(V,E) is 
an undirected graph, where the nodes, V(G), consist of the 
genomes and the pairwise synteny alignments between 
the genomes constitute the edges of the graph, E(G). The 
graph edges are weighted by genomic distances if avail- 
able, and otherwise they are non-weighted. For a given 
pair of genomes, (u, v), the shortest path is found by min- 
imizing the edge distances on the path. 

Coordinate translation 

Each interval specified in the source GTF is translated 
individually by looking up the lower bound of the start 
and higher bound of the end position. The coordinate 



start and stop can either directly fall into syntenic an- 
chors, or in between anchors, in which case the candi- 
date interval is widened to the next adjacent syntenic 
map entries. This process is repeated until the target 
genome is reached. Translated target coordinates on the 
same chromosome or scaffold with syntenic flanks in 
consistent orientation of up to 100,000 nt are passed on 
to the next step, otherwise, the region is split into two 
target intervals, one from each side (50,000 nt each) to 
allow for searching the boundaries of syntenic breaks. 

Rapid alignment 

For computational efficiency, Kraken performs a quick 
search of the source sequence against the target interval 
by employing an approximate cross-correlation alignment, 
as originally implemented by Satsuma [12]. To limit the 



Zamani et al. BMC Bioinformatics 2014, 15:227 
http://www.bionnedcentral.conn/1471 -21 05/1 5/227 



Page 4 of 1 3 



processing time and memory requirements of the under- 
lying Fourier Transform, the target interval is broken into 
overlapping sequences of 2^^ nucleotides in size, the 
source sequence is cross-correlated against each block, 
and the block with the highest absolute cross-correlation 
signal is computed. Based on the size of the source inter- 
val, Kraken determines a candidate region equal to that 
size plus flanks on each end (+/- 12 nt per default) based 
on the offset of the cross-correlation signal. 

Dynamic local re-alignment 

Detailed alignment of the source sequence is performed 
using the Cola [20] implementation of a banded alignment 
with gap-affine penalties [21,22] against the target sub- 
sequence, which is defined by the offset determined by the 
rapid alignment. For source intervals of >100 nt, the se- 
quence is split into two 100 nt chunks at each end cover- 
ing the start and end region of the sequence, both in the 
source and the target genomes. A p-value threshold to 
accept alignments is configurable and defaults to 10"^. 
Alignments are not required to cover the entire source se- 
quence, i.e. nucleotide mismatches at the boundaries are 
permissible (as forcing alignments would infer the risk of 
false insertions). In that case, the final translated target co- 
ordinates are estimated based on the alignment offset into 
the source sequence, i.e. if the alignment starts at position 
k in the source feature, the target start coordinate is ad- 
justed by -k nucleotides (and vice versa for the final stop 
coordinate). Output is provided in GTF format, where for 
all the items that were successfiilly translated an output 
entry is produced containing the translated coordinates. 

Feature matching 

Optionally, Kraken allows for directly comparing translated 
coordinates to features specified in the coordinates of the 
target genome, following the GTF file convention for 
exons, transcripts, and genes. Kraken stores GTF coordi- 
nates internally in the following data structures (classes): 
(i) exons, which are intervals (implemented as annotation 
items), consisting of a chromosome name, start and end 
coordinate; (ii) transcripts, which are generalizations of 
items and extend functionality by owning a number of 
exons; and (iii) loci, which also generalize items but own 
a set of transcripts. Kraken instantiates arrays of each 
type in sorted order for efficient retrieval, and stores own- 
ership relationships between items, transcripts and loci bi- 
directionally to facilitate fast referencing in either direction. 
Thus, Kraken allows for rapidly inferring spatial relation- 
ships between the genomic features in the source and tar- 
get genome, if the latter are available in GTF format, and 
taking into account their multi-exonic structures. Kraken 
classifies matches as: (i) full sense overlap, i.e. all exons of 
the source transcript overlap all exons of the target tran- 
script and fall on the same strand; (ii) partial sense overlap. 



i.e. one or more exon overlap in sense direction; (iii) 
intronic (sense or antisense), i.e. the coordinates of the 
source transcript overlap an intron of a target locus; and 
(iv) antisense (full or partial), i.e. overlapping target exons 
in the opposite strand. The coordinates of all translated 
features, the relationships described above, and the over- 
lapping target annotations are reported in human-readable 
outputs that are also friendly to machine parsers. 

Results and discussion 

Translating genomic intervals between 12 fruit fly 
species: evaluating synteny graphs 

To examine how translation through intermediate syn- 
teny maps along a graph impacts sensitivity, we evalu- 
ated Kraken on 264 pairwise comparisons between the 
genomes of 12 Drosophila fruit flies [16]. Drosophila 
species are a diverse group (Figure 2a) and cover small 
to large genomic timespans (Figure 2b), allowing for 
measuring accuracy as a function of genomic distance. 
Moreover, the genomes are of moderate size (-150 Mb), 
so that computing a full set of pairwise synteny maps as 
the baseline for comparison is computationally feasible. 
We first generated all pairwise genome-wide syntenic 
alignments using Satsuma [12]. We next selected 75,000 
random genomic intervals of 200 base pairs in size for 
each genome, and used Kraken to translate these se- 
quences into the coordinates of all other genomes using 
three different topologies: (i) a star configuration with 
D.melanogaster in the center (Figure 2c); (ii) a star con- 
figuration with D.sechiUia in center (Figure 2d); and (iii) 
a clade configuration (Figure 2e) roughly following spe- 
cies phylogeny (Figure 2a). Table 1 summarizes the re- 
sults. We first observed that for closely related species, 
such as D.yakuba and D.erecta, more than 80% of blocks 
could be successfully translated, while this fraction 
drops to about 12% for more distantly related species 
such as D.simulans and D.wilstoni. However, genomic 
distance does not majorly impact the fraction of suc- 
cessful indirect to direct translations: for example, for 
the closely related species pair D.erecta/D.yakuba in the 
melanogaster star configuration, 97.9% of directly trans- 
lated blocks are matched in the indirect translation, 
while this fraction is only slightly lower for the more 
distantly related pairs D.erecta/D.mojavensis (96.9%) 
and D.erecta/D.willstoni (95.5%). The median fraction 
of identical indirect/direct translations is highest in the 
D.melanogaster star topology (Figure 2c), followed by 
the D.sechellia star (Figure 2d), and clade (Figure 2d) 
topologies. While overall, using the high-quality genome 
of D.melanogaster as the center yields the best results 
with respect to the topologies evaluated here, individual 
statistics suggest that a more complex topology allowing 
for alternative paths to traverse the synteny graph yields 
higher sensitivity (Table 1). 
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Figure 2 Synteny topologies for indirect feature mappings in 12 drosophila species, (a) The piiylogeny of 12 Drosophila species is sliown, 
as inferred by [16], and (b) a comparison of evolutionary distances between fruit flies and vertebrates, (c) Shown is a Kraken star configuration 
\N\th D.melanogaster, the most complete genome, in the middle, where indirect translations are forced to map through the D.melanogaster 
genome, (d) Alternatively, a star configuration centered around D.sechellia is shown, as well as (e) a more complex configuration loosely 
modeled on species phylogeny and with three genomes serving as transition points. 



Matching genes between human, rat, and mouse: a 
quantitative assessment 

We evaluated Krakens quantitative performance with 
regards to known genomic features by matching trans- 
lated gene structures across three well-annotated high- 
quality mammalian genomes: human (hgl9), mouse 
(mmlO), and rat (rn5). Using pairwise synteny maps be- 
tween the genomes generated by LASTZ [9], we trans- 
lated all annotated genes (Ensembl 68) in six pairwise 
comparisons across genomic coordinates. Table 2 sum- 
marizes the results: overall, between 77% and 95% of 
transcripts could be unambiguously translated, with be- 
tween 90% and 96% of these overlapping with annotated 
loci in the target genome. This fraction is higher for the 
protein-coding set, ranging from 93% to 98%, likely due 
to higher sequence conservation and sequence similarity. 



Kraken matched more annotated transcripts between 
human and mouse than human and rat, possibly because 
of differences in the quality of the genome builds and/or 
annotation. Among the 1,760 protein coding gene loci 
that failed to be translated from human to mouse, more 
than half (926) can be divided into the following five cat- 
egories: (i) 483 uncharacterized and hypothetical pro- 
teins; (ii) 188 loci with predicted open reading frames 
but without functional prediction; (iii) 113 pseudogenes 
from ribosomal proteins; (iv) 56 zinc finger proteins; 
(v) 51 olfactory receptors; and (vi) 35 keratin associated 
proteins. While the first two categories represent genes 
of uncertain annotation status, the latter are comprised 
of members of gene families that are highly variable in 
copy number, or genes that have been known to undergo 
lineage-specific expansions. Figure 3 shows three-way 
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Table 1 Comparison of direct and indirect pairwise coordinate translations for the M-Drosophila dataset 
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64934 


46720 


72% 


45819 


97.2 


46188 


96.5 


45819 


97.2 


D.vir 


D.yak 


63073 


8988 


14% 


8781 


93.5 


8851 


93.6 


8649 


91.4 


D.wil 


D.yak 


48969 


6204 


13% 


5684 


87.5 


5748 


87.8 


5061 


76.8 



Median Mapping Ratio (%) 97.4 93.6 91.7 

Each row shows results for blocks of 200 nucleotides randomly chosen from the source genome. Results are shown for three different configurations for indirect 
translation that are depicted in Figure 2, namely, D.melanogaster-star, D.sechellia-star and the clade-center configuration, where in each case we report two items: 
(i) total number of blocks translated through the specified configuration (indirect) and (ii) the fraction of directly translated blocks that are matched by identical 
indirect translations. To compact the results, pairwise comparisons where at least one configuration would result in a direct comparison have not been included 
here. Also, the species are shown in abbreviated format; see Figure 2 for equivalent complete species name. 
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Table 2 Results of translating transcripts between the human, mouse, and rat genomes shown for all transcripts and 
separately for coding transcripts only 



Target source Human Mouse Rat 



Human 


All Transcripts Mapped 




152237 (83.3%) 


148655 (81.4%) 




All Transcripts With Overlap 




141702 (93.1%) 


135324 (91.0%) 




Coding Transcripts Mapped 




75531 (97.1%) 


74238 (95.4%) 




Coding Transcripts With Overlap 




74356 (98.4%) 


72069 (97.1%) 


Mouse 


All Transcripts Mapped 


73448 (80.7%) 




86420 (95.0%) 




All Transcripts With Overlap 


70243 (95.6%) 




75606 (89.4%) 




Coding Transcripts Mapped 


42506 (90.7%) 




45802 (97.8%) 




Coding Transcripts With Overlap 


41704 (98.1%) 




43483 (94.9%) 


Rat 


All Transcripts Mapped 


30396 (76.8%) 


35361 (89.4%) 






All Transcripts With Overlap 


29368 (96.6%) 


31840 (90.0%) 






Coding Transcripts Mapped 


28754 (87.2%) 


31313 (94.9%) 






Coding Transcripts With Overlap 


27974 (97.3%) 


29025 (92.7%) 






All Transcripts Total 


182723 


90956 


39549 




Coding Transcripts Total 


77808 


46836 


32971 



For each comparison instance we show two Items: (1) the total number of transcripts that Kraken successfully translates and (II) the number of successfully 
translated transcripts for which we find an exon overlap based on comparison with a reference annotation. We also show In brackets: (I) the percentage of 
translated transcripts to the total number of transcripts and (II) the ratio of overlapping transcripts to the total number translated. 



Venn diagrams based on maximal overlaps (i.e. one-to-one 
relationships between transcripts in different genomes) for 
all transcripts (Figure 3a), and for the protein-coding sub- 
sets (Figure 3b). As in the pairwise comparisons, the frac- 
tion of overlaps is higher for protein coding genes. 
Notably, the rat annotates the lowest number of isoforms 
per protein coding gene, however, more than 90% of those 
overlap isoforms in both human and mouse, and almost all 
overlap mouse transcripts, suggesting that the rat annota- 
tion mostly contains dominant isoforms. We attribute the 
large fraction of 'human-only transcripts to the more than 



three-fold number of annotated alternative isoforms per 
human gene locus. 

We next compared the results to the methods HalLift- 
over [15] and RATT [23], using the Ensembl Compara 
database [24], a manually curated set of orthologous re- 
lationships between protein-coding transcripts across 
species for human, as independent metrics (Table 3). All 
methods largely agree in their predictions, with Kraken 
consistently and on all data sets exhibiting the highest 
overlap with ComparaDB, as well as the lowest number 
of transcripts predicted by ComparaDB only (Table 3). 




Figure 3 Venn diagrams of transcript overlaps between human, mouse, and rat. (a) All annotated transcripts, which include all isoforms 
of genes, are shown for each species in circles. Two-way and three-way spatial matches between transcripts in different species are shown as 
overlaps in the middle, counting only maximal matches (isoforms for which the largest number of exons overlap across genomes). In human, the 
number of annotated isoforms is about two-fold higher than in mouse and four-fold higher than in rat. (b) Counts and cross-species overlaps of 
transcripts from only protein coding loci are shown. 

V J 
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numbers of amino acids across species, indicating that 
the predicted coordinates could reflect actual biological 
differences in transcription between species. By contrast, 
we did not observe this pattern for exons of long inter- 
genie non-coding genes (Figure 4b), which further sup- 
ports that the periodicity observed for protein coding 
genes is biologically valid, rather than stemming from al- 
gorithmic artifacts, such as systematic alignment biases. 

Resolving transcribed single-copy genes and gene family 
orthologs in vertebrates 

In order to illustrate Kraken s power as a practical tool for 
processing large and complex datasets, we re-analyzed 
publicly available RNA-Sequence data obtained from dif- 
ferent vertebrates and multiple tissues [17]. We first 
mapped the RNA-Sequence reads onto the respective ge- 
nomes of human, chimp, gorilla, opossum, and chicken 
using Tophat [18], without reference annotation guidance, 
and allowing for the detection of un-annotated transcripts 
and isoforms. Next, we estimated RPKM expression values 
from the unpaired reads using Cufflinks/Cuffdiff [18] and 
merged experimentally found transcripts with the refer- 
ence annotations (Ensembl 64) using Cuffmerge [18]. We 
then used Kraken to translate the coordinates of tran- 
scribed features through a minimal set of LASTZ-chained 
alignments. A selection of pairwise interspecies transcript 
relationships are summarized in Table 5: overall, the num- 
ber of transcripts that can be translated decreases with 
genomic distance from 250,000 for human-chimp to 
100,000 for chicken-human. Likewise, the fraction of 
protein-coding genes, which are among the most highly 
conserved genomic features, is higher for more distantly 
related organisms. We next investigated whether protein- 
coding single copy genes and gene family members, 
defined by Ensembl [5] based on clustering by protein 
similarity, consistently show distinct expression patterns 
across tissues and species (we excluded genes with RPKM 



Table 4 Data demonstrating precision of translation at nucleotide level, shown for coding sequence, between human, 
mouse, and rat 



Target source 




Human 


Mouse 


Rat 


Human 


Total Items Mapped 




241261 


225012 




Exactly Matched Items 




1 74432 (72.3%) 


156654 (69.6%) 




Exactly Matched at least One Side 




231333 (95.9%) 


214390 (95.3%) 


Mouse 


Total Items Mapped 


201649 




200606 




Exactly Matched Items 


157304 (78.0%) 




166079 (82.8%) 




Exactly Matched at least One Side 


188982 (93.7%) 




195357 (97.4%) 


Rat 


Total Items Mapped 


174516 


180701 






Exactly Matched Items 


132835 (76.1%) 


148839 (82.4%) 






Exactly Matched at least One Side 


160294 (91.9%) 


171099 (94.7%) 





For each comparison three items are shown: (i) total number of coding sequences translated from source to target species, (ii) number of items translated that 
also find an exact match in the reference target annotation (iii) similar to ii but also including those items that exactly match either on their start or stop 
coordinate. For (ii) and (iii), the percentage of category items matched to the total number mapped is given in brackets. 



Table 3 Results of comparing the Human-Mouse-Rat 
transcript overlaps with Ensembl ComparaDB orthologs 
for Kraken, HalLiftover, and RATT 







Common 
in both 


Unique to 
ComparaDB 


Not In 
ComparaDB 


Human to mouse 


Kraken 


124733 


5683 


19378 




HalLiftover 


122754 


7662 


26009 




RA^ 


15665 


114751 


2337 


Human to rat 


Kraken 


118419 


6869 


19233 




HalLiftover 


1 1 7869 


7419 


28859 




RA^ 


14494 


110794 


2114 


Mouse to rat 


Kraken 


66667 


2172 


10624 




HalLiftover 


61337 


7502 


9033 




RATT 


61127 


9394 


7712 



The numbers in each row left to right show (i) the overlap, i.e. transcripts 
predicted by both ComparaDB and the methods Kraken, HalLiftover, and RATT 
respectively; (ii) transcripts found only in ComparaDB; and (iii) transcripts 
found by Kraken, HalLiftover, or RATT, but not ComparaDB. 



We note that RATT, which we ran in species mode' 
[23], yielded similar results to HalLiftover and Kraken 
on the mouse/rat comparison, but considerably fewer 
predictions for the more distantly related human/mouse 
and human/rat data sets. 

Accuracy on nucleotide level: a qualitative assessment 

We next examined the accuracy of Kraken, comparing 
the translated boundaries of exons to gene models native 
to the target genome for the human, mouse, and rat 
dataset (Table 4). For protein coding exons, Kraken 
matches 70 to 83% of complete exons between genomes 
with exact agreement at both splice junctions, and up to 
97% of exon with exact matches in at least one splice 
junction. In examining cases in which the predictions 
did not exactly match the annotations, we found that 
the majority of differences differ by multiples of three 
nucleotides (Figure 4a), which is consistent with varying 
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a) 




£ 1000 



-10 0 10 20 30 40 -40 -30 -20 -10 0 10 

Coordinate basepair difference between mapped and reference annotation item 

Figure 4 Histogram of nucleotide differences between native mouse annotations and predictions from human, (a) After excluding exact 
matches between the native mouse annotations and exon coordinates predicted from translated human annotations, the differences for coding 
genes show a pronounced periodicity of multiples of three nucleotides, consistent with differences in amino acid counts between species, rather 
than mapping or alignment errors, (b) By contrast, non-coding RNAs do not show any periodicity, consistent with these sequences not being 
subject to translation and thus free of constraint to preserve multiplicity. 



values < 1). We computed Spearman s correlation coeffi- 
cients between the tissues of all species-pairs separately 
for the two gene sets, and compared them to each other, 
highlighting the differences (Figure 5): in all comparisons 
of mismatched tissues (e.g. brain versus liver), single copy 
genes are more correlated than orthologous gene family 
members (light grey dots. Figure 5). Moreover, gene family 
paralogs are more highly correlated in matched brain and 
cerebellum tissues compared to single copy genes (black 
dots. Figure 5), while liver and kidney show the opposite 
trend (green and blue dots. Figure 5), with heart showing 
patterns in between (pink dots. Figure 5). In all cases, testis 
is least correlated compared with other tissues (yellow dots. 
Figure 5) both in single copy genes and gene families, and 
for large genomic distances (chicken-human, chicken- 
opossum, opossum-human). Testis genes are also less cor- 
related in cross-species comparisons, while correlation is 



comparable to other tissues in primates (red dots. Figure 5). 
In terms of absolute values for correlations of both single 
copy and multiple copy genes, species are grouped accord- 
ing to phylogeny, including the primate clade, albeit statisti- 
cally weakly (p < 0.086), with 10 out of 15 matched tissue 
comparisons (including male/female pairs) placing human 
closer to chimp than gorilla. Notably, comparisons involv- 
ing kidney indicate higher correlations for human-gorilla 
than human-chimp. 

Expression of novel transcripts in three primates 

Kraken s independence of known annotations or protein 
coding genes allows for the analysis of transcribed loci 
that have not previously been characterized. For ex- 
ample, thousands of long intergenic non-coding RNAs 
have recently been found in human [25], mouse [26], 
zebrafish [27], and dog [28]. In the human genome, we 



Table 5 Pairwise interspecies transcript relationships based on analyzing RNA-Sequence data generated and published 
by Brawand et al. 201 1 



Species in comparison 
Source Target 


Total Source 
transcripts 


Mapped 

transcripts w/hit 


EnsembI 
transcripts 


Protein coding 
transcripts 


Singleton 
transcripts 


Singleton 
loci 


Families 
transcripts 


Families 
loci 


Chicken 


Human 


100823 


51492 


50278 


49944 


13149 


3968 


36795 


8857 




Opossum 




49500 


48535 


48271 


12452 


3702 


35819 


8497 




Chimp 




47357 


46466 


46212 


11786 


3514 


34426 


8233 




Gorilla 




48585 


47604 


47326 


12146 


3632 


35180 


8420 


Opossum 


Human 


162989 


80994 


78492 


77581 


16936 


4246 


60645 


12611 




Chimp 




75338 


73622 


72979 


15960 


3979 


57019 


11581 




Gorilla 




76486 


74620 


73935 


16227 


4049 


57708 


11837 


Human 


Chimp 


253887 


211785 


205096 


180964 


41267 


5486 


139697 


16234 




Gorilla 




216536 


210180 


183659 


41743 


5588 


141916 


16498 



For each source-target genome pair, we list the number of transcripts mapped from the source that overlap with annotations in the target, indicating the number 
of annotated transcripts, and the number annotated as protein coding genes in the target genome. We further list overlaps of single-copy genes and gene family 
members, determined by the genome of the source, with annotated protein coding genes in the target both at the transcript and locus level. 
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Single Copy Gene Correlation Coefficients 



# Liver 
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• Kidney 


Testis/Other Mismatched Tissues 
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• Cerebellum 


• Brain/Cerebellum 



Figure 5 Correlations of single-copy genes and gene family members. We show scatter plots comparing tissue-specific correlations of 
single-copy genes and gene family members on the species pairs: (a) chicken versus human, (b) chicken versus opossum, (c) opossum versus 
chimpanzee, (d) opossum versus human, (e) human versus chimpanzee, and (f) human versus gorilla. In all panels. Spearman's correlation is 
shown on the x-axis for single-copy genes, and on the y-axis for gene families. Tissue comparisons are coded by color (see legend), in all cases, 
brain tissues (labeled as 'brain' and 'cerebellum' are more correlated for gene families (black dots), while mismatched-tissue comparisons, with 
the exception of the brain and cerebellum tissues, are more correlated for single copy genes (light grey dots). 



found a total of 67,580 actively transcribed loci based on 
RNA-Seq data [17], 17,421 of which fell outside of Ensembl 
annotations. Of these, 6,281 and 5,968 were also actively 
transcribed in chimpanzee and gorilla respectively, with 
3,503 transcribed in all three genomes. While the expres- 
sion levels are correlated across species at much lower 



levels than those of known protein-coding genes, matched 
tissue comparisons still exhibit statistically significant cor- 
relations (Figure 6), with testis being the most correlated 
tissue in both human-chimp and human-gorilla. The re- 
ported correlation measures have high statistical signifi- 
cance with a maximum p-value of 10"^. 
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Figure 6 Correlation of un-annotated transcribed features in human, gorilla, and chimpanzee. We show the Spearman's correlation 
between RPKM expression values across tissues of novel transcribed features between human versus chimp (purple) and human versus gorilla 
(orange), ranked by correlation in the latter set. Overall, novel transcripts are more highly correlated between human and gorilla, with, in both 
cases, testis being the most highly correlated tissue, followed by heart, liver, and kidney. The sex of the tissue donors was not found to be 
statistically significant. 



Conclusions 

The analytical power of comparing features across multiple 
genomes has been demonstrated in the past and dates back 
to the early days of modern genomics. More recent 
examples, in which the order and orientation of genomic 
landmarks played a critical role, include describing a whole 
genome duplication event in the baker s yeast Saccharomy- 
ces cerevisiae [29], and several analyses of evolution in the 
chordate and vertebrate lineages (e.g. [30-33]). Recent ad- 
vances in DNA sequencing and assembly, coupled with the 
generation and processing of functional sequence datasets, 
notably RNA-Sequencing, which often result in hundreds 
of thousands of transcripts, highlight the need for a new 
generation of high-throughput computational analysis 
tools. Such tools are required to process: (i) large genomes; 
(ii) large numbers of genomes; (iii) large sets of genomic 
features, including large numbers of paralogous loci; but: 
(iv) without large computational efforts. Here we described 
Kraken, a novel computational method and software that 
fills this niche. Using the genomes of 12 fruit flies, we 
showed that the ability to translate indirectly, i.e. use an 
intermediate genome as a guide and then re-computing 
local alignments, comes at marginal cost in sensitivity, 
but at a substantial gain in computational efficiency: by 
removing the need to statically compute all-to-all synteny 
maps, Kraken scales linearly with the number of genomes 
and allows for the simultaneous analysis of dozens or even 
hundreds of genomes. We next evaluated Kraken on the 
well established and highly scrutinized genomes of human, 
mouse, rat, and showed that mapping orthologous 



sequences is highly accurate in predicting the precise 
boundaries of genomic features. This is particularly rele- 
vant for comparing protein-coding genes, since errors 
of even a single nucleotide can cause erroneous frame 
shifts and stop codons in open reading frames. Thus, 
Kraken can be used with ease to either create or im- 
prove comprehensive annotations through orthology for 
genomes that have little or no validated evidence within 
a few CPU hours, a functionality that is also provided by 
other, albeit more specialized software programs, such 
as RATT [23], GeneMapper [34], and HalLiftover [15]. 

To showcase Kraken s utility for large-scale research pro- 
jects aimed at discovering hitherto unexplored functional 
connections between orthologous members of gene 
families residing in different genomes, we re-analyzed a 
previously published large and comprehensive dataset 
consisting of RNA-Seq data from different tissues and 
multiple vertebrates [17]. We emphasize that Kraken 
completed the analysis presented here within only a few 
hours of wall clock time and with minimal human in- 
volvement, yet yielding a rich set of results that are 
concordant with biological expectations and previous 
reports. In summary, we found that single-copy genes 
are more highly correlated across tissues and species 
than gene family members, which would indicate that 
single copy genes are enriched for fundamental cellular 
function essential to cells of different tissue types. By con- 
trast, paralogs of gene families could have taken on more 
specialized roles, consistent with the concept of duplica- 
tion and subsequent neo- and sub-functionalization [35]. 
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Moreover, orthologous gene family members are more 
highly correlated in brain tissues across species than their 
single-copy gene counterparts. This suggests that a num- 
ber of genes, which were duplicated early in the vertebrate 
lineage, took on specialized functions in the brain, and 
were subsequently fixed in expression patterns, whereas 
we did not observe this consistently in the other tissues 
for which sequence was available. The pattern we see 
would fit with the theory that the complexity of the verte- 
brate brain and nervous system can in part be contributed 
to the redundancy of genes following the two whole gen- 
ome duplications early in vertebrate evolution [36,37]. 
Moreover, in three primates we found overlap of un- 
annotated transcribed regions, likely non-coding RNAs, 
with transcription levels most highly correlated across 
species in testis. Testis has been known to be the most ac- 
tively transcribed tissue [38], and our results suggest that 
transcription does not occur in a random fashion. Long 
non-coding RNAs have recently been shown to take part 
in the circuitry controlling stem cell pluripotency and cell 
differentiation [26], consistent with expression in repro- 
ductive tissues, and indicating that the set of known RNAs 
is still an incomplete subset of the full inventory of all 
such transcribed loci. 

In conclusion, we expect Kraken to dramatically reduce 
computational analysis time when deployed in future large- 
scale comparative studies. For a newly sequenced mamma- 
lian genome, for example, generating synteny maps to a 
single other mammal is sufficient for comparing it to 
dozens of others, at little extra computational cost. More- 
over, the availability of second- and third-generation DNA 
and RNA sequencing technologies have made it possible to 
expand the set of reference genomes and expressed se- 
quences into different branches of life. Successful and rapid 
analyses will thus require a computational framework that 
is easy to use and easy to set up. 
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